Identification of clusters, communities or niches in spatial transcriptomic data

Cluster identification is similar to grouping similar things together. Say, if you have a basket full of different fruits, you would group apples with apples, bananas with bananas, and so on. In spatial transcriptomics, we are doing the same thing, but with complex gene expression data. The objective is to group together cells (or spatial locations) that exhibit similar gene expression profiles. These ‘clusters’ can provide insight into different cell types, states, or functional regions within the tissue, ultimately enabling researchers to gain a detailed understanding of biological processes and diseases. Further we can do clustering based on the gene expression, based on the spatial location or both together. I will guide you through the different options of spatial clustering and how we can use this for data integration.

Cluster by gene-expression

Shared Nearest Neighbor (SNN)

We commence the second part of our workshop with a simple and automated clustering method using SPATA2 reprocessing. Our chosen algorithm for this process is the Shared Nearest Neighbor (SNN) clustering, implemented through Seurat. Here, we’ll also focus on understanding and interpreting clusters, and strive to define associated biological processes.

Let’s briefly discuss the SNN concept. This approach only considers gene expression in specific spots while disregarding their spatial positions: Shared Nearest Neighbor (SNN) clustering is a method used to group or cluster data points that are very similar to each other. Imagine you are in a room full of people. Now, suppose you want to make groups of people who share common interests. How would you go about it? You might start by asking each person about their top 5 interests. Those who share the most common interests with each other would form a group, or in other words, a cluster. The SNN algorithm works in a similar way, but with data points instead of people and gene expression profiles instead of interests. Instead of comparing all possible pairs of data points (which can be a very large number), each data point looks at its ‘nearest neighbors’ – these are the data points that have the most similar gene expression profiles. For each pair of data points, the SNN algorithm calculates how many ‘neighbors’ they share. If they have a lot of neighbors in common, they are considered to be very similar and are grouped together into the same cluster. So in simple terms, SNN clustering is like making groups of data points that have a lot of common ‘friends’, where ‘friends’ are defined as having similar gene expression profiles.

In high-dimensional data analysis we represent the data using UMAPS or other methods for dimensional reduction to simplyfy the data. This comes with disadvantages, for example the distance of spots represented in a UMAP are not meaningfull when it comes to interpretation of the similarity across clusters for example.

# Like every workshop we start with loading a SPATA2 object

object <- readRDS("path/to/Test_object.RDS")

Lets plot the spatial distribution of the SNN clustering and the UMAP presentation

p1 <- 
  SPATA2::plotSurface(object, color_by="seurat_clusters", pt_alpha=0.6, display_image = T)+
  SPATA2::ggpLayerAxesSI(object)+
  ggpLayerThemeCoords(unit = "mm")+
  ggtitle("SNN Cluster")+
  scale_color_manual(values=RColorBrewer::brewer.pal(8, "Set3"))+
  Seurat::NoLegend()

plot <- SPATA2::plotUmap(object, color_by = "seurat_clusters", pt_size = 0.5)
library(scattermore)

p2 <- 
  ggplot(data=plot$data)+
  geom_scattermore(mapping=aes(umap1,umap2), pointsize = 8, color="black")+
  geom_scattermore(mapping=aes(umap1,umap2), pointsize = 7, color="white")+
  geom_point(mapping=aes(x=umap1, y=umap2, color=seurat_clusters),size=0.5)+
  scale_color_manual(values=RColorBrewer::brewer.pal(8, "Set3"))+
  theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black",
                                     angle = 90, vjust = 0.5, hjust=1), 
          axis.text.y = element_text(colour="black"))+
  ggtitle("UMAP Representation")+
  coord_fixed(ratio = 2)

p1+p2

As presented by both plots, SNN clustering identified 8 different clusters, at least 4 of them appeared to be spatially segregated without overlap. This means that we have regions in our sample that are transcriptionally unique in distinct niches (without integration the space into our analysis). Having these information, I would be interesting to get a bit more information of each of the groups. Since we have defined the groups, we can test which genes are differently expressed in the samples:

object <- SPATA2::runDEA(object, across="seurat_clusters")

## Get the differently expressed genes:
tab <- SPATA2::getDeaResultsDf(object, across="seurat_clusters")
tab %>% head()

# Lets do same examples and get the 10 top expressed genes of cluster 5 (argument: avg_log2FC)
tab %>% 
  filter(seurat_clusters==5) %>% 
  arrange(desc(avg_log2FC)) %>% 
  head(10) %>% 
  pull(gene)

For a simple analysis we can use ChatGPT and ask about the annotation of gene functions:

ChatGPT prompt:

Reset

The list of genes provided includes "MT1X," "VEGFA," "ADM," "IGFBP5," "GAPDH," "LGALS3," "CHI3L1," "MT2A," "HILPDA," and "DDIT3." Your task is to provide a concise summary of the functions of these genes and identify any shared characteristics or 'signatures' they may collectively exhibit. Please present your findings in the form of a paragraph appropriate for the results section of a scientific research paper.

The answer:

Our investigation focused on the analysis of ten specific genes: MT1X, VEGFA, ADM, IGFBP5, GAPDH, LGALS3, CHI3L1, MT2A, HILPDA, and DDIT3. The roles these genes play in cellular processes are multifaceted. Both MT1X and MT2A belong to the metallothionein family and are involved in the binding of heavy metals, providing protection against metal toxicity. VEGFA is crucial in the process of angiogenesis, instigating the formation of new blood vessels. ADM has roles in several physiological processes such as vasodilation and hormone secretion. IGFBP5 is part of the insulin-like growth factor (IGF) system and influences cell growth and development. GAPDH, predominantly known for its role in glycolysis, is also implicated in various other cellular processes like apoptosis and DNA repair. LGALS3 is involved in the modulation of apoptosis and cell proliferation. CHI3L1 is involved in inflammation and tissue remodeling, while HILPDA, a hypoxia-inducible gene, regulates lipid metabolism. DDIT3, induced by various cellular stresses, is integral to the initiation of apoptosis. Upon scrutiny of these genes, a common theme regarding stress responses and regulatory functions was observed. Many of these genes play vital roles in response to stress-related events, for instance, hypoxia (HILPDA), cellular stress (DDIT3), and defense against toxic heavy metals (MT1X, MT2A). Additionally, several genes contribute to the regulation of cell growth, proliferation, and apoptosis (VEGFA, IGFBP5, GAPDH, LGALS3). While the functions of the genes individually are diverse, their collective impact suggests a significant involvement in cellular stress management and regulation of critical biological processes. These shared characteristics hint at key mechanisms involved in cellular resilience and adaptive capabilities.

That’s helpful, but it would be also helpful to validate the functions based on a analysis. This analysis called Gene Ontology Analysis.

library(DOSE)
library(enrichplot)
library(clusterProfiler)


# First we create a list of the top 50 genes per cluster
list_genes <- map(0:8, .f=function(i){
  tab %>% 
  filter(seurat_clusters==i) %>% 
  arrange(desc(avg_log2FC)) %>% 
  head(50) %>% 
  pull(gene)
})

names(list_genes) <- paste0("Cluster_", 0:7)

# Run the Gene Ontology Analysis of the different clusters
CC <- compareCluster(list_genes, 
                     fun="enrichGO",
                     ont ="BP", 
                     keyType = "SYMBOL", 
                     OrgDb = "org.Hs.eg.db")
edo <- pairwise_termsim(CC)

## Plot the comparison
col <- colorRampPalette(c(RColorBrewer::brewer.pal(9, "Greens"))[4:9])

enrichplot::dotplot(edo, showCategory=3)+
  scale_color_gradientn(colours = col(50), name="")+
  guides(color = guide_colourbar(barwidth = 0.3, barheight = 8, 
                                 ticks =F, frame.colour="black"), label=F, size="none")+
  ggtitle("Gene Ontology")+
  xlab("")+ylab("")+
  theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black",
                                     angle = 90, vjust = 0.5, hjust=0.5 ,size=5), 
          axis.text.y = element_text(colour="black", size=5))

## Get a Network representation of the enriched pathways in all cluster

#Set the random start point to get allways the same plots
set.seed(200)
plot <- 
  emapplot(edo, showCategory=50, cex_category=0.01, 
           pie.params = list(pie = "equal"), 
           cex_line = 0.3,
           layout="graphopt", min_edge=0.15, repel=T)



res_map <- 
  edo@compareClusterResult %>% 
  filter(Description %in% plot$data$name) %>% 
  group_by(Description, Cluster) %>% 
  summarise(p = mean(p.adjust)) %>% 
  ungroup() 

anno <- data.frame(name=unique(res_map$Description), type=map(.x=unique(res_map$Description), .f=function(i){
  c <- res_map %>% filter(Description==i) %>% pull(Cluster)
  return(c[res_map %>% filter(Description==i) %>% pull(p) %>% which.min()] %>% as.character())
  }) %>% unlist())

library(ggrepel)
plot$data <- plot$data %>% left_join(., anno, by="name")


ggplot(data=plot$data)+
  plot$layers[[1]]+
  geom_point(mapping = aes(x=x, y=y, fill=type, size=size), colour="black",pch=21)+
  scale_fill_manual(values=RColorBrewer::brewer.pal(8, "Set3"))+
  scale_size(range=c(1, 2))+
  theme_bw() +
  guides(size="none")+
  xlab("Dim 1")+ylab("Dim 2")+
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black", size=0.5), 
          axis.text.y = element_text(colour="black", size=0.5))+
    coord_fixed()+
  geom_text_repel(mapping= aes(x,y, label=name),
                  size=2, force=10, force_pull=10, max.overlaps=20,
                  segment.size=0.1)+
  ggtitle("GeneSet Enrichment Analysis")

We can confirm that the cluster 5 (orange) is highly associated with cellular stress under hypoxia, either by the ChatGPT or the Gene Ontology analysis.

Spatial Cluster Analysis

There are plenty of different algorithm to perform spatial clustering. I will first show the a algorithm called “BayesSpace” which helps to identify transcriptional niches by integration the spatial localisation and gene expression similarity. The clustering can by performed by the following function:

# This will take some time ... take a coffee !
# If you want to speed the process up, the clustering is saved in the object so you dont have to run it...
object <- SPATA2::runBayesSpaceClustering(object)
#Plot:
SPATA2::plotSurface(object, color_by="bayes_space", pt_alpha=0.6, display_image = T)+
  SPATA2::ggpLayerAxesSI(object)+
  ggpLayerThemeCoords(unit = "mm")+
  ggtitle("SNN Cluster")+
  scale_color_manual(values=RColorBrewer::brewer.pal(8, "Set3"))

First we will compare both algorithms to check which cluster are similar and which are different. Therfore we can use mosaike plots which incooperate the number of data points and the distribution at the same time.

data <- SPATA2::joinWithFeatures(object, features=c("seurat_clusters", "bayes_space"))

library(ggmosaic)
ggplot(data = data) +
  geom_mosaic(aes(x = product(seurat_clusters, bayes_space), fill=seurat_clusters)) + 
  scale_fill_manual(values=RColorBrewer::brewer.pal(8, "Set3"))+
  theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black", size=10), 
          axis.text.y = element_text(colour="white", size=10))+
    coord_fixed()+
  ggtitle("Compare SNN and BayesSpace")

The subsequent analysis for cluster characterization will follow the previously explained procedures. Given our expanded understanding of the biological functions, our focus will now shift to gaining deeper insights into cellular distribution across the clusters. To accomplish this, we will employ the method of spot deconvolution.

Spot deconvolution using single cell RNA-seq reference data

Imagine you have a cake with various layers of different flavors. Now, your task is to find out what percentage of each flavor is in a slice of the cake. That’s similar to what we’re doing in spatial transcriptomics with a process called “spot deconvolution”. We’re trying to figure out what types of cells (the ‘flavors’) are present in each spot (the ‘slice’) of a tissue sample (the ‘cake’). To do this, we start with the algorithm RCTD which is one of the most robust algorithm on the market. RCTD, or Reference Component Tissue Decomposition, is a method we use to figure out the ‘recipe’ of our slice. Here’s how it works in simple terms:

  1. Making a Recipe Book: First, we use a set of single cells (which we already know a lot about) to make a ‘recipe book’. This book has the average gene expression (the ‘taste’ of the cell) for each cell type.

  2. Matching Flavors: For each spot, RCTD tries to match the observed gene expression (the ‘taste’ of the spot) to the cell type expressions in our recipe book. It’s like trying to guess the cake recipe by tasting the slice.

  3. Considering All Possibilities: But there might be many different recipes that could give the same taste. So instead of just guessing one recipe, RCTD makes a list of all possible recipes that could explain the taste of our slice.

  4. Sampling Recipes: RCTD then randomly picks from these possible recipes many, many times. This gives us a spread of possible recipes for each spot.

    By considering many possible ‘recipes’ for each spot, RCTD gives us a good idea of the different types of cells that could be in each spot, and how sure we are about these guesses. This is very useful for understanding the complex mix of cells in our tissue samples.

First we need a dataset (reference dataset) with single cells which need to contain defined labels of the cells. Here we use a downsized dataset of the GBMap (comprehansive reference atlas for GBM). The single cell data are normaly stored as seurat objects (more info on seurat: https://satijalab.org/seurat/). The seurat reference object is also uploaded.

ref <- readRDS("~/path/Down_scaled_reference.RDS")
# Colors of the cell types
colors <- readRDS("~/path/colors_cell_deconv.RDS")
col_cells <- colors$colors
names(col_cells) <- colors$annotation_level_4
# Plot the single cell data
f <- DimPlot(ref, group.by  = "annotation_level_4")
library(scattermore)
f$data$annotation_level_4 <- factor(f$data$annotation_level_4, levels = colors$annotation_level_4)
p <- ggplot(data=f$data)+
  geom_scattermore(mapping=aes(UMAP_1,UMAP_2), pointsize = 8, color="black")+
  geom_scattermore(mapping=aes(UMAP_1,UMAP_2), pointsize = 7, color="white")+
  geom_scattermore(mapping=aes(UMAP_1,UMAP_2, color=annotation_level_4), pointsize=4)

p <- p+
    scale_colour_manual(values=col_cells)+
    ylab("UMAP2")+xlab("UMAP1")+
    theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          legend.text = element_text(size = 5),
          legend.title = element_text(colour="white", size=3),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black"), 
          axis.text.y = element_text(colour="black"))+
    coord_fixed()+
  ggtitle("Reference dataset")+
  guides(color = guide_legend(override.aes = list(size = 4)))+
  theme(legend.key.size = unit(0.5, "cm"))
p

We will use the SPATAWrappers extension to run the RCTD deconvolution:

# First install the package:
options(timeout = 600000000) ### set this to avoid timeout error
devtools::install_github("dmcable/spacexr", build_vignettes = FALSE)

#This step also take some time, so relax... if you want to speed up, the output is saved in the object...

# Run the analysis
object <- SPATAwrappers::runRCTD(object, ref=ref, cell_type_var = "annotation_level_4")

The analysis will output the celltype probabilities in the fdata slot of the spata object. From there we can visualize the cell distributions in the different clusters. For example let’s plot the tumor cell types across clusters:

tumor_celltypes <- 
  colors %>% 
  filter(annotation_level_2 %in% c("Stem-like", "Differentiated-like")) %>% 
  pull(annotation_level_4)

data <- joinWith(object, features = c("bayes_space", tumor_celltypes))

#
data <- 
  data %>% 
  pivot_longer(cols = tumor_celltypes) %>% 
  dplyr::select(c(bayes_space,name,value)) %>% 
  group_by(bayes_space,name) %>% 
  summarise(value=mean(value))


a <- ggplot(data = data) +
  geom_col(aes(x =bayes_space, y=value, fill=name)) + 
  scale_fill_manual(values=col_cells)+
  theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black", size=10), 
          axis.text.y = element_text(colour="black", size=10))+
    coord_fixed(ratio = 10)+
  ggtitle("Cell types in cluster")



T_cells <- 
  colors %>% 
  filter(annotation_level_2 %in% c("Lymphoid")) %>% 
  pull(annotation_level_4)

data <- joinWith(object, features = c("bayes_space", T_cells))

#
data <- 
  data %>% 
  pivot_longer(cols = T_cells) %>% 
  dplyr::select(c(bayes_space,name,value)) %>% 
  group_by(bayes_space,name) %>% 
  summarise(value=mean(value))


b <- ggplot(data = data) +
  geom_col(aes(x =bayes_space, y=value, fill=name)) + 
  scale_fill_manual(values=col_cells)+
  theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_rect(colour = "black", size=0.5),
          axis.text.x = element_text(colour="black", size=10), 
          axis.text.y = element_text(colour="black", size=10))+
    coord_fixed(ratio = 120)+
  ylim(0,0.05)+
  ggtitle("Cell types in cluster")


a

b

From this data, we can interpretate that in cluster 1 (hypoxia) the most abundant cell type are mesenchymal tumor cells and the T cell fraction appeared significantly lower. The results can also be presented in a surface plot:

col=colorRampPalette(c("#FFFFFF", colors[colors$annotation_level_4=="MES_like_hypoxia_MHC", ]$colors))
a <- 
  SPATA2::plotSurface(object, color_by="MES_like_hypoxia_MHC", pt_alpha=1, display_image = F)+
  scale_colour_gradientn(colours = col(50),limit=c(0,0.5), 
                           oob = scales::squish, na.value = "white", name="Enrichment")+
  guides(color = guide_colourbar(barwidth = 0.3, barheight = 8, ticks =F, frame.colour="black"), label=F)+
  SPATA2::ggpLayerAxesSI(object)+
  SPATA2::ggpLayerTissueOutline(object)+
  ggpLayerThemeCoords(unit = "mm")+
  ggtitle("Mes-Hypoxia")

col=colorRampPalette(c("#FFFFFF", colors[colors$annotation_level_4=="NPC_like_neural", ]$colors))
b <- 
  SPATA2::plotSurface(object, color_by="NPC_like_neural", pt_alpha=1, display_image = F)+
  scale_colour_gradientn(colours = col(50),limit=c(0,0.5), 
                           oob = scales::squish, na.value = "white", name="Enrichment")+
  guides(color = guide_colourbar(barwidth = 0.3, barheight = 8, ticks =F, frame.colour="black"), label=F)+
  SPATA2::ggpLayerAxesSI(object)+
  SPATA2::ggpLayerTissueOutline(object)+
  ggpLayerThemeCoords(unit = "mm")+
  ggtitle("NPC")

a+b